Case Study: US Craft Beer and Brewery

Hao Wang, Alex Salamah

October 2019

Introduction

In this case study, we analyzed two data sets provided by our client Budweiser, the Beers dataset contains a list of 2410 US craft beers and Breweries dataset contains 558 US breweries. The purpose of the code is to find statistical characters of IBU and ABV data, also the correlation between them to address questions from Budweriser.

Repository Structure

Case study Repository in GitHub: https://github.com/48120778/EDAforProject1.git Files: Reers.csv: Source beer data. Breweries.csv: Source breweries data. EDAforProject1_R3.RMD: R Markdown file with all R code. EDAforProject1_R3.HTML: HTML file created with Knit.

Analysis

1. How many breweries are present in each state?

Number of breweries in 51 states has been calculated and provided as the below table, together with a plot with numbers in order.

## 
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##   7   3   2  11  39  47   8   1   2  15   7   4   5   5  18  22   3   4 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##   5  23   7   9  32  12   9   2   9  19   1   5   3   3   4   2  16  15 
##  OK  OR  PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##   6  29  25   5   4   1   3  28   4  16  10  23  20   1   4

2. Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.

We merged the two data set into one and named it as ‘mergeDF’, the primary key being used is ‘Brewery_id’ from Beer data set, and ‘Brew_ID’ from Breweries data set. We also changed the two columns’ name for clear understanding. The first and last 6 observations were showed there with head/tail command.

## [1] "The first 6 observations of the merged file:"
##   Brewery_id     Beer_name Beer_ID   ABV IBU
## 1          1  Get Together    2692 0.045  50
## 2          1 Maggie's Leap    2691 0.049  26
## 3          1    Wall's End    2690 0.048  19
## 4          1       Pumpion    2689 0.060  38
## 5          1    Stronghold    2688 0.060  25
## 6          1   Parapet ESB    2687 0.056  47
##                                 Style Ounces       Brewery_name
## 1                        American IPA     16 NorthGate Brewing 
## 2                  Milk / Sweet Stout     16 NorthGate Brewing 
## 3                   English Brown Ale     16 NorthGate Brewing 
## 4                         Pumpkin Ale     16 NorthGate Brewing 
## 5                     American Porter     16 NorthGate Brewing 
## 6 Extra Special / Strong Bitter (ESB)     16 NorthGate Brewing 
##          City State
## 1 Minneapolis    MN
## 2 Minneapolis    MN
## 3 Minneapolis    MN
## 4 Minneapolis    MN
## 5 Minneapolis    MN
## 6 Minneapolis    MN
## [1] "The last 6 observations of the merged file:"
##      Brewery_id                 Beer_name Beer_ID   ABV IBU
## 2405        556             Pilsner Ukiah      98 0.055  NA
## 2406        557  Heinnieweisse Weissebier      52 0.049  NA
## 2407        557           Snapperhead IPA      51 0.068  NA
## 2408        557         Moo Thunder Stout      50 0.049  NA
## 2409        557         Porkslap Pale Ale      49 0.043  NA
## 2410        558 Urban Wilderness Pale Ale      30 0.049  NA
##                        Style Ounces                  Brewery_name
## 2405         German Pilsener     12         Ukiah Brewing Company
## 2406              Hefeweizen     12       Butternuts Beer and Ale
## 2407            American IPA     12       Butternuts Beer and Ale
## 2408      Milk / Sweet Stout     12       Butternuts Beer and Ale
## 2409 American Pale Ale (APA)     12       Butternuts Beer and Ale
## 2410        English Pale Ale     12 Sleeping Lady Brewing Company
##               City State
## 2405         Ukiah    CA
## 2406 Garrattsville    NY
## 2407 Garrattsville    NY
## 2408 Garrattsville    NY
## 2409 Garrattsville    NY
## 2410     Anchorage    AK

3. Address the missing values in each column.

The missing values can only be found in column IBU (1005 NA’s) and column ABV (62 NA’s), refer to red bars in the first plot. As we don’t want to delete all the missing values, we started with Option 1 to deal with the missing values, in which the state level median values were calculated for both IBU and ABV and used to replace missing values respectively, but we found that the correlation between the IBU and ABV would be heavily impacted due to this replacement.

Option 1: Simply replacing missing values in IBU/ABV with state median results in significant drop on correlation, from 67% to 49% (Refer to #7)

##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##   8   1   4  23  48 119  21   4   1  21   9   9   5  13  52  48   4   7 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##   9  31  11  20 124   9  13   0  17  29   0  16   6   0   8   3  28  17 
##  OK  OR  PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##   8  38  53   7   9   7   1  41  15   5  10  25  45   0   3
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##  25  10   5  47 183 265  27   8   2  58  16  27  30  30  91 139  23  21 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##  19  82  21  27 162  55  42  11  40  59   3  25   8   8  14  11  74  49 
##  OK  OR  PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##  19 125 100  27  14   7   6 130  26  40  27  68  87   2  15
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##   0   0   0   3   1  15   0   0   1   2   0   0   0   0   0   2   0   1 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##   0   0   0   0  11   0   3   0   1   4   0   4   0   0   1   1   1   0 
##  OK  OR  PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##   0   0   3   0   0   0   0   6   0   0   0   0   2   0   0

##    Brewery_id                     Beer_name       Beer_ID      
##  Min.   :  1.0   Nonstop Hef Hop       :  12   Min.   :   1.0  
##  1st Qu.: 94.0   Dale's Pale Ale       :   6   1st Qu.: 808.2  
##  Median :206.0   Oktoberfest           :   6   Median :1453.5  
##  Mean   :232.7   Longboard Island Lager:   4   Mean   :1431.1  
##  3rd Qu.:367.0   1327 Pod's ESB        :   3   3rd Qu.:2075.8  
##  Max.   :558.0   Boston Lager          :   3   Max.   :2692.0  
##                  (Other)               :2376                   
##       ABV               IBU                                    Style     
##  Min.   :0.00100   Min.   :  4.00   American IPA                  : 424  
##  1st Qu.:0.05000   1st Qu.: 26.00   American Pale Ale (APA)       : 245  
##  Median :0.05700   Median : 35.00   American Amber / Red Ale      : 133  
##  Mean   :0.05973   Mean   : 39.81   American Blonde Ale           : 108  
##  3rd Qu.:0.06700   3rd Qu.: 47.00   American Double / Imperial IPA: 105  
##  Max.   :0.12800   Max.   :138.00   American Pale Wheat Ale       :  97  
##                                     (Other)                       :1298  
##      Ounces                          Brewery_name            City     
##  Min.   : 8.40   Brewery Vivant            :  62   Grand Rapids:  66  
##  1st Qu.:12.00   Oskar Blues Brewery       :  46   Portland    :  64  
##  Median :12.00   Sun King Brewing Company  :  38   Chicago     :  55  
##  Mean   :13.59   Cigar City Brewing Company:  25   Indianapolis:  43  
##  3rd Qu.:16.00   Sixpoint Craft Ales       :  24   San Diego   :  42  
##  Max.   :32.00   Hopworks Urban Brewery    :  23   Boulder     :  41  
##                  (Other)                   :2192   (Other)     :2099  
##      State       IBUmedian           ABVmedian      
##   CO    : 265   Length:2410        Min.   :0.04000  
##   CA    : 183   Class :character   1st Qu.:0.05500  
##   MI    : 162   Mode  :character   Median :0.05700  
##   IN    : 139                      Mean   :0.05677  
##   TX    : 130                      3rd Qu.:0.05800  
##   OR    : 125                      Max.   :0.06250  
##  (Other):1406

Option 2: State level median values were only used to replace ABV missing values (62 NA’s in ABV), then we predicted the IBU values (1005 NA’s) with a Simple Linear Regression model base on the exporatory variable ABV, this way we keep the correlation between IBU/ABV no big change. This is the option that we used to get the data frame ‘tidyDF’, which is used to address the following questions.

## 
## Call:
## lm(formula = mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.849 -11.977  -0.721  13.997  93.458 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -34.099      2.326  -14.66   <2e-16 ***
## mergeDF_tidy$ABV 1282.037     37.860   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared:  0.4497, Adjusted R-squared:  0.4493 
## F-statistic:  1147 on 1 and 1403 DF,  p-value: < 2.2e-16
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   21.00   35.00   42.71   64.00  138.00    1005
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   26.00   36.41   42.49   55.64  138.00

4. Compute the median alcohol content(ABV) and international bitterness unit (IBU) for each state. Plot a bar chart to compare.

Median ABV and IBU data have been sorted and plotted.

5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

Maximum ABV and IBU data have been sorted and plotted.

6. Comment on the summary statistics and distribution of the ABV variable.

Statistics of ABV variable can be found with summary function, boxplot, barchart and scatter plots have been provided.

##    Brewery_id                     Beer_name       Beer_ID      
##  Min.   :  1.0   Nonstop Hef Hop       :  12   Min.   :   1.0  
##  1st Qu.: 94.0   Dale's Pale Ale       :   6   1st Qu.: 808.2  
##  Median :206.0   Oktoberfest           :   6   Median :1453.5  
##  Mean   :232.7   Longboard Island Lager:   4   Mean   :1431.1  
##  3rd Qu.:367.0   1327 Pod's ESB        :   3   3rd Qu.:2075.8  
##  Max.   :558.0   Boston Lager          :   3   Max.   :2692.0  
##                  (Other)               :2376                   
##       ABV               IBU                                    Style     
##  Min.   :0.00100   Min.   :  0.00   American IPA                  : 424  
##  1st Qu.:0.05000   1st Qu.: 26.00   American Pale Ale (APA)       : 245  
##  Median :0.05700   Median : 36.41   American Amber / Red Ale      : 133  
##  Mean   :0.05973   Mean   : 42.49   American Blonde Ale           : 108  
##  3rd Qu.:0.06700   3rd Qu.: 55.64   American Double / Imperial IPA: 105  
##  Max.   :0.12800   Max.   :138.00   American Pale Wheat Ale       :  97  
##                                     (Other)                       :1298  
##      Ounces                          Brewery_name            City     
##  Min.   : 8.40   Brewery Vivant            :  62   Grand Rapids:  66  
##  1st Qu.:12.00   Oskar Blues Brewery       :  46   Portland    :  64  
##  Median :12.00   Sun King Brewing Company  :  38   Chicago     :  55  
##  Mean   :13.59   Cigar City Brewing Company:  25   Indianapolis:  43  
##  3rd Qu.:16.00   Sixpoint Craft Ales       :  24   San Diego   :  42  
##  Max.   :32.00   Hopworks Urban Brewery    :  23   Boulder     :  41  
##                  (Other)                   :2192   (Other)     :2099  
##      State        ABVmedian        predict_IBU    
##   CO    : 265   Min.   :0.04000   Min.   :  0.00  
##   CA    : 183   1st Qu.:0.05500   1st Qu.: 30.00  
##   MI    : 162   Median :0.05700   Median : 38.98  
##   IN    : 139   Mean   :0.05677   Mean   : 42.49  
##   TX    : 130   3rd Qu.:0.05800   3rd Qu.: 51.80  
##   OR    : 125   Max.   :0.06250   Max.   :130.00  
##  (Other):1406

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 5 rows containing non-finite values (stat_count).

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 5 rows containing non-finite values (stat_count).
## Warning: position_stack requires non-overlapping x intervals
## Warning: Removed 3 rows containing missing values (geom_bar).

## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.

Scatter plots with linear model smooth function are provided, the correlation between IBU and ABV is 0.757878, with a p-value < 2e-16, Multiple R-squared is 0.5744, meaning that 57% changes in exploratory variable ABV can explain the changes in response variable IBU.

## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53
## 
##  Pearson's product-moment correlation
## 
## data:  mergeDF_tidy$IBU and mergeDF_tidy$ABV
## t = 33.863, df = 1403, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6407982 0.6984238
## sample estimates:
##       cor 
## 0.6706215
## 
## Call:
## lm(formula = mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.849 -11.977  -0.721  13.997  93.458 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -34.099      2.326  -14.66   <2e-16 ***
## mergeDF_tidy$ABV 1282.037     37.860   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared:  0.4497, Adjusted R-squared:  0.4493 
## F-statistic:  1147 on 1 and 1403 DF,  p-value: < 2.2e-16

## 
##  Pearson's product-moment correlation
## 
## data:  tidyDF$IBU and tidyDF$ABV
## t = 57.005, df = 2408, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7403535 0.7743731
## sample estimates:
##      cor 
## 0.757878
## 
## Call:
## lm(formula = tidyDF$IBU ~ tidyDF$ABV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.718  -4.712  -0.035   2.762  93.442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -33.818      1.372  -24.65   <2e-16 ***
## tidyDF$ABV  1277.567     22.411   57.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 2408 degrees of freedom
## Multiple R-squared:  0.5744, Adjusted R-squared:  0.5742 
## F-statistic:  3250 on 1 and 2408 DF,  p-value: < 2.2e-16

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 1005 rows containing non-finite values (stat_smooth).
## Warning: Removed 1005 rows containing missing values (geom_point).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

8. Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN clustering to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand.

A category variable IPA_Ale is created to split all the 2410 beers into three groups: Ale (1007), IPA (571) and Other (832), we only use the first two groups for analysis in this part. With this KNN classification model, we get an estimated accuracy of 0.7893 with k=5.

## 
##   Ale   IPA Other 
##  1007   571   832

##                 
## classifications5 Ale IPA
##              Ale 199  41
##              IPA  42 112
## Confusion Matrix and Statistics
## 
##                 
## classifications5 Ale IPA
##              Ale 199  41
##              IPA  42 112
##                                           
##                Accuracy : 0.7893          
##                  95% CI : (0.7457, 0.8286)
##     No Information Rate : 0.6117          
##     P-Value [Acc > NIR] : 3.385e-14       
##                                           
##                   Kappa : 0.5571          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8257          
##             Specificity : 0.7320          
##          Pos Pred Value : 0.8292          
##          Neg Pred Value : 0.7273          
##              Prevalence : 0.6117          
##          Detection Rate : 0.5051          
##    Detection Prevalence : 0.6091          
##       Balanced Accuracy : 0.7789          
##                                           
##        'Positive' Class : Ale             
## 

##                 
## classifications5 Ale IPA
##              Ale 182  35
##              IPA  25  74
## Confusion Matrix and Statistics
## 
##                 
## classifications5 Ale IPA
##              Ale 182  35
##              IPA  25  74
##                                           
##                Accuracy : 0.8101          
##                  95% CI : (0.7625, 0.8519)
##     No Information Rate : 0.6551          
##     P-Value [Acc > NIR] : 8.741e-10       
##                                           
##                   Kappa : 0.5705          
##                                           
##  Mcnemar's Test P-Value : 0.2453          
##                                           
##             Sensitivity : 0.8792          
##             Specificity : 0.6789          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.7475          
##              Prevalence : 0.6551          
##          Detection Rate : 0.5759          
##    Detection Prevalence : 0.6867          
##       Balanced Accuracy : 0.7791          
##                                           
##        'Positive' Class : Ale             
## 

9. Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.

# Analyze External Datasource AlcoholDF
  AlcoholDF$region <- tolower(AlcoholDF$State)
# AlcoholDF$NumTot <- as.numeric(AlcoholDF$Total)
  AlcoholDF$NumTot <- as.numeric(AlcoholDF$GreaterThan12Est)
  AlcoholDF$chaTot<- as.character(AlcoholDF$NumTot)
  AlcoholMap.df <- left_join(states,AlcoholDF, by = "region")
  
  # Top10 cities latitude and longitude
  LosAngeles <- tibble(long = c(-118.243683), lat = c(34.052235), names = c("CA\n7726"))
  Austin <- tibble(long = c(-97.743057), lat = c(30.267153), names = c("TX\n5302"))
  Orlando <- tibble(long = c(-81.379234), lat = c(28.538336), names = c("FL\n4370"))
  NewYork <- tibble(long = c(-76.147423), lat = c(43.048122), names = c("NY\n4201"))
  Chicago <- tibble(long = c(-87.629799), lat = c(41.878113), names = c("IL\n2949"))
  Philadelphia <- tibble(long = c(-75.165222), lat = c(39.952583), names = c("PA\n2909"))
  Columbus <- tibble(long = c(-82.998795), lat = c(39.961178), names = c("OH\n2425"))
  Boston <- tibble(long = c(-71.058884), lat = c(42.360081), names = c("MA\n1851"))
  Trenton <- tibble(long = c(-74.759720), lat = c(40.220581), names = c("NJ\n1836"))
  Atlanta <- tibble(long = c(-84.387985), lat = c(33.748997), names = c("GA\n1757"))
  Raleigh <- tibble(long = c(-78.638176), lat = c(35.779591), names = c("NC\n1728"))
  
  
  p2 <- ggplot(AlcoholMap.df, aes(long, lat)) +
    geom_polygon(aes(fill = GreaterThan12Est, group=group), stat= "identity", color = "white") +
    #geom_text(aes(label=state), hjust=0.5, vjust=0.5, size=3)  
    #geom_text(aes(label = BrewriesMPD$count, )) +
    scale_fill_gradientn (colours= (heat.colors(3)),na.value="grey90") +
    ggtitle( "Alcohol Abuse by State", subtitle = "Geopositional Analysis") +
    coord_quickmap()
  p2 + geom_text(data = LosAngeles, aes(x = long, y = lat, label = names), hjust = 0.5, vjust=-0.1,nudge_x = 0, color = "black", size=3) +
    geom_text(data = Austin, aes(x = long, y = lat, label = names), hjust = 0.8, vjust=-0.1, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Orlando, aes(x = long, y = lat, label = names), hjust = 0.5, nudge_x = 0, color = "black", size=3) +
    geom_text(data = NewYork, aes(x = long, y = lat, label = names), hjust = 0.5, vjust=-0.2, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Chicago, aes(x = long, y = lat, label = names), hjust = 1, vjust=1, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Philadelphia, aes(x = long, y = lat, label = names), hjust = 1, vjust=-0.1,nudge_x = 0, color = "black", size=3) +
    geom_text(data = Columbus, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Boston, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Trenton, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Atlanta, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 0, color = "black", size=3) +
    geom_text(data = Raleigh, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 0, color = "black", size=3)

Summary

Through our analysis:…